Goto

Collaborating Authors

 example problem


OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Sun, Yiyou, Hu, Shawn, Zhou, Georgia, Zheng, Ken, Hajishirzi, Hannaneh, Dziri, Nouha, Song, Dawn

arXiv.org Artificial Intelligence

Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.


Complex LLM Planning via Automated Heuristics Discovery

Ling, Hongyi, Parashar, Shubham, Khurana, Sambhav, Olson, Blake, Basu, Anwesha, Sinha, Gaurangi, Tu, Zhengzhong, Caverlee, James, Ji, Shuiwang

arXiv.org Artificial Intelligence

We consider enhancing large language models (LLMs) for complex planning tasks. While existing methods allow LLMs to explore intermediate steps to make plans, they either depend on unreliable self-verification or external verifiers to evaluate these steps, which demand significant data and computations. Here, we propose automated heuristics discovery (AutoHD), a novel approach that enables LLMs to explicitly generate heuristic functions to guide inference-time search, allowing accurate evaluation of intermediate states. These heuristic functions are further refined through a heuristic evolution process, improving their robustness and effectiveness. Our proposed method requires no additional model training or fine-tuning, and the explicit definition of heuristic functions generated by the LLMs provides interpretability and insights into the reasoning process. Extensive experiments across diverse benchmarks demonstrate significant gains over multiple baselines, including nearly twice the accuracy on some datasets, establishing our approach as a reliable and interpretable solution for complex planning tasks.


SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance

Singh, Kunal, Biswas, Ankan, Bhowmick, Sayandeep, Moturi, Pradeep, Gollapalli, Siva Kishore

arXiv.org Artificial Intelligence

We propose Step-by-Step Coding (SBSC): a multi-turn math reasoning framework that enables Large Language Models (LLMs) to generate sequence of programs for solving Olympiad level math problems. At each step/turn, by leveraging the code execution outputs and programs of previous steps, the model generates the next sub-task and the corresponding program to solve it. This way, SBSC, sequentially navigates to reach the final answer. SBSC allows more granular, flexible and precise approach to problem-solving compared to existing methods. Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on MathOdyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC's greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on MathOdyssey.


BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Zhang, Beichen, Liu, Yuhong, Dong, Xiaoyi, Zang, Yuhang, Zhang, Pan, Duan, Haodong, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial Intelligence

Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.


Give me a hint: Can LLMs take a hint to solve math problems?

Agrawal, Vansh, Singla, Pratham, Miglani, Amitoj Singh, Garg, Shivank, Mangal, Ayush

arXiv.org Artificial Intelligence

While state-of-the-art LLMs have shown poor logical and basic mathematical reasoning, recent works try to improve their problem-solving abilities using prompting techniques. We propose giving "hints" to improve the language model's performance on advanced mathematical problems, taking inspiration from how humans approach math pedagogically. We also test robustness to adversarial hints and demonstrate their sensitivity to them. We demonstrate the effectiveness of our approach by evaluating various diverse LLMs, presenting them with a broad set of problems of different difficulties and topics from the MATH dataset and comparing against techniques such as one-shot, few-shot, and chain of thought prompting.


Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Luo, Kangyang, Ding, Zichen, Weng, Zhenmin, Qiao, Lingfeng, Zhao, Meng, Li, Xiang, Yin, Di, Shu, Jinlong

arXiv.org Artificial Intelligence

Prior to our efforts, there has already been work striving towards this goal. For example, Self-ICL (Chen et al., 2023) begins by prompting the LLM to generate few-shot new, diverse, and creative proxy queries tailored to the target task, and then solves each of that independently using the ZS-CoT manner, which in turn yields proxy exemplars for prompting LLMs to engage in reasoning. Auto-ICL (Yang et al., 2023) operates similarly to Self-ICL, but it differs in that Auto-ICL instructs the LLM to produce proxy queries that have the same structure as the given query. Analogical Prompting (Yasunaga et al., 2023) draws on the cognitive process of solving new problems from relevant past experiences, i.e., inspired by analogical reasoning, which prompts the language model to self-generate relevant examples in context before embarking on the solution of a given query. Notably, the one-pass generation mode employed in Analogical Prompting necessitates that the LLM possesses robust capabilities for both following instructions and generating responses. We revisit the aforementioned approaches and discern that their efficacy hinges on guiding the LLM to recall experiences relevant to the given query. However, solely considering such experiences may lead to the generation of proxy queries that are as challenging as the given query, along with corresponding erroneous proxy solutions, potentially misleading the solution of the original given query.


NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Zhang, Shudan, Zhao, Hanlin, Liu, Xiao, Zheng, Qinkai, Qi, Zehan, Gu, Xiaotao, Zhang, Xiaohan, Dong, Yuxiao, Tang, Jie

arXiv.org Artificial Intelligence

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.


Tutorial and Practice in Linear Programming: Optimization Problems in Supply Chain and Transport Logistics

Bridgelall, Raj

arXiv.org Artificial Intelligence

This tutorial is an andragogical guide for students and practitioners seeking to understand the fundamentals and practice of linear programming. The exercises demonstrate how to solve classical optimization problems with an emphasis on spatial analysis in supply chain management and transport logistics. All exercises display the Python programs and optimization libraries used to solve them. The first chapter introduces key concepts in linear programming and contributes a new cognitive framework to help students and practitioners set up each optimization problem. The cognitive framework organizes the decision variables, constraints, the objective function, and variable bounds in a format for direct application to optimization software. The second chapter introduces two types of mobility optimization problems (shortest path in a network and minimum cost tour) in the context of delivery and service planning logistics. The third chapter introduces four types of spatial optimization problems (neighborhood coverage, flow capturing, zone heterogeneity, service coverage) and contributes a workflow to visualize the optimized solutions in maps. The workflow creates decision variables from maps by using the free geographic information systems (GIS) programs QGIS and GeoDA. The fourth chapter introduces three types of spatial logistical problems (spatial distribution, flow maximization, warehouse location optimization) and demonstrates how to scale the cognitive framework in software to reach solutions. The final chapter summarizes lessons learned and provides insights about how students and practitioners can modify the Phyton programs and GIS workflows to solve their own optimization problem and visualize the results.


DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, Yu, Tao

arXiv.org Artificial Intelligence

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.


A Tour of Machine Learning Algorithms

#artificialintelligence

In this post, we will take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are available. There are so many algorithms that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit. I want to give you two ways to think about and categorize the algorithms you may come across in the field. Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.